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Abstract: This paper introduces and develops a novel variable importance score function in the context 
of ensemble learning and demonstrates its appeal both theoretically and empirically. Our proposed score 
function is simple and more straightforward than its counterpart proposed in the context of random 
forest, and by avoiding permutations, it is by design computationally more efficient than the random 
forest variable importance function. Just like the random forest variable importance function, our score 
handles both regression and classification seamlessly. One of the distinct advantage of our proposed score 
is the fact that it offers a natural cut off at zero, with all the positive scores indicating importance and 
significance, while the negative scores are deemed indications of insignificance. An extra advantage of our 
proposed score lies in the fact it works very well beyond ensemble of trees and can seamlessly be used 
with any base learners in the random subspace learning context. Our examples, both simulated and real, 
demonstrate that our proposed score does compete mostly favorably with the random forest score. 
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1. Introduction 


Consider a data set @ = {(xi, yi), • • • , (x„, y„)} where x, : is a p-dimensional vector of attributes of potentially 
different types observable on some input space denoted here by ST , and y, are the responses taken from <3f. We 
shall consider various scenarios, but mainly the regression scenario with = R and the classification scenario 
with W = {1,2, • • ■ , K}. We consider the task of building the estimator /(•) of the true but unknown underlying 
/, and seek to build /(•) such that the true error (generalization error) is as small as possible. In this context, 
we shall use the average test error AVTE(-), as our measure of predictive performance, namely 


avte (/) = 2 J2 
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where ^x/' * 1 , yf 'j is the jth observation from the test set at the rth random replication of the split of the data. 
Throughout this paper, we shall use the zero-one loss (1.2) for all our classification tasks. 
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For regression tasks, we shall use the squared error loss (1.2), namely 




(1.3) 


Besides, seeking the optimal predictive estimator of /, we also seek to select the most important (useful) predictor 
variables as a byproduct of our overall learning scheme. Indeed, while accurate prediction is very important in 
and of itself, it’s often desirable or even crucial in some cases, provide the added description of the importance 
of the variables involved in the prediction task. The statistical literature is filled with thousands of papers on 
variable selection and measurement of variable importance. 


1 

imsart-generic ver. 2014/07/30 file: perf-score-version-l.tex date: January 27, 2015 


Corresponding author 



Ernest Fokoue/PERF Score 


2 


2. Main result 


We consider a framework with a p-dimensional input space with typical input vector x = (xi, • • • , x p ) T . We 
also consider building different models with different subsets of the p original variables. Let 7 = ( 71 , • • • , 7 P ) T 
denote the p-dimensional indicator such that 


{ 1 if xj is active in the current model indexed by 7 
0 otherwise. 


( 2 . 1 ) 


Assume that we are given an ensemble (collection or aggregation) of models, say 

^ = {6(-, 7 (1) )), h(; 7 (2) )), • • ■ , h(; 7 (S) ))} (2.2) 

where 6 (-, 7 ®)) denotes the function built with only those variables that are active in the 6 th model of the 
ensemble (aggregation), and 7 ^ = ( 7 ^, • • • , 7 p' 1 ) with 

(b) f 1 if x 7 - is active in the 6 -th model of the ensemble . . 

^ = 0 otherwise. (2 ' 3) 


For instance, we may consider a homogeneous ensemble, i.e, an ensemble in which all the functions are of the 
same family, like the case where all the base learners are multiple linear regression (MLR) models differing by the 
variables upon which they are built. Consider a score function score(6(-,7 ( - & - ) )) used to assess the performance 
of model indexed by the variables active in 7^. We propose a variable importance score in the form of a function 
that measures the importance of a variable xj in terms of the reduction in average score 

PERF( Xi ) = -^^]score(6(-,7 (&) )) - ^ if’ ) score(6(-,7 (6) )) ( 2 . 4 ) 

a b=l i 6=1 

where Bj is the number of models containing the variable x 7 , specifically Bj = I n words, 

PERF(xj) = Average score over all models — Average score over all models with xj 

Intuitively, PERF(xj) somewhat measures the impact of variable x ? . In the way similar to the approach used by 
sports writers to decide the MVP on a team or in a league, PERF(xj) looks at the overall performance of the 
whole ensemble and then for each variable xj computes the direction and magnitude of the change to that overall 
performance of the ensemble brought by its presence in models. If a variable Xj is important, then its presence 
in any model will cause that model to perform better in the sense of having a lower than common average error 
(score). The average score of all models containing an important variable Xj should therefore be lower than the 
overall average score. 

• |PERF(xj)| measures the magnitude of the importance/impact. 

• sign(PERF( X j)) measures the direction of the impact. 

• If sign(PERF(xj)) = +1 and |PERF(xy)| is relatively large, then xj is an important variable. 

• Seamlessly applied to large p small n. 

• All variables with PERF(xj) < 0 are unimportant and can be discarded. 

• The PERF(-) score can be used whenever an ensemble Jif is available along with a suitable score function 
for each base learner. 

• This works with any base learner and can be adapted to parametric, nonparametric and semi-parametric 
models and one can imagine ensembles with any base learners as its atoms. 

• A great advantage over the traditional variable importance Breiman (2001a), Breiman (2001b) score 
functions is that the clear cut-off at zero, in the sense that all variables with PERF(xj) > 0 are kept and 
all those variables with PERF(xj) < 0 are thrown away. 


2.1. PERF score via Random Subspace Learning 

A natural implementation of PERF(-) can be done using the ubiquitous bootstrap along with the random subspace 
learning scheme. The Out-of-Bag (00b) error in the bagging or random subspace learning context is a good 
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(in fact excellent) candidate score function, especially when the goal if the selection of variables that lead to the 
lowest prediction error. The advantage of using oob as the score lies in the fact that the score is obtained as 
part of building the ensemble in the random subspace learning framework. Consider the training set = {z t = 
(xj, y;) , i = 1, ■ • • ,n}, where = (xji,--- , x, p ) and y, £ & are realizations of two random variables X 
and Y respectively. Let x )7r . = (x^!,-- - , Xj >Wj .,--- , Xj jd ). The permutation tt j acts the |-dimensional jth 
column of the out-of-bag data matrix. Essentially, -Kj simply permutes the elements of the jth column of 

the out-of-bag data matrix. 


Algorithm 1 PERF Score Estimate via Random Subspace Learning 

1: procedure PERF ScORE(iJ) > Computing the PERF Score based on B base learners 

2: Choose a base learner h(-) > e.g.: Trees, MLR 

3: Choose an estimation method > e.g.: Recursive Partitioning or OLS 

4: Initialize all the PERF(xj) and VI (x.j ) at zero 

5: for 6 = 1 to B do 

6: Draw with replacement from 0& a bootstrap sample = {z^, • • • , z^} 

7: Draw without replacement from {1, • • • ,p} a subset 'f'W = {j[ b \ • • • , j^} of d variables. 

8 : Form the indicator vector 7 ^) = ( 7 , 7 p^) with 



1 

0 


otherwise. 


9: Drop unselected variables from so that i^ub ^ dimensional 

10: Build the 6 th base learner h(-, 7 ^) based on ^ub 

11: Compute score of the 6 th base learner h(-, 7 ^) > e.g. Out-of-bag error 

s 3 * * (6) = score(/i(-, 7 (i,) )) = -E- ^ Ryi, h(x.i, 7 (6) )) 

12: for j € do 

13: Generate the permutation of the jth column of 3^ , namely 


14: 


15: 


16 

17 

18 


Compute the permutation impacted score 


= score,^. (/((•, T' (f>) )) = ^, 7 W )) 




Compute the 6 th instance of the importance of Xj 


end for 
end for 


Use the ensemble J4? = |/i(-, 7 ^^), 6 = 1, • • • , B | to form the estimator 

1 S _ -i B 

PERF(xj) = —^scor e(J(, 7 W)) - — ^ 7 ^ score(ft(-, 7 (6) )) 


3 6=1 


(2.5) 


19: end procedure 


VI (3 






(2.6) 


3. Computational demonstrations 
3.1. Simulated Example 

The dataset in this example is simulated data with different scenarios on the level of correlation among the 
variables, and the ratio n and p. In this particular example, the true function is 

/(x) = 1 + 2x 3 + x 7 + 3x 9 
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(a) Permutation-free Variable Importance. 


(b) Permutation-based Variable Importance. 


Fig 1. Variable score for simulated data with high correlation among the variables in low dimension high sample size setting 


with x ~ MVN(lg, E p ) and e ~ N(0, 1). The dataset in this example is simulated data with different scenarios 
on the level of correlation among the variables, and the ratio n and p. Specifically, we simulate data by defining 
p £ [0,1), then we generate our predictor variables using a multivariate normal distribution. Throughout this 
paper, the multivariate Gaussian density will be denoted by </> p (x; /x, E) 


<Mx;/T E) 


1 

— , exp 



A t ) T E 1 (x 



(3T) 


Furthermore, in order to study the effect of the correlation pattern, we simulate the data using a covariance 
matrix E parameterized by r and p and defined by tE where E = (ay,) with ay,- = pl l—J ’L 

/ 1 p ••• pP~ 2 pP- 1 \ 

p 1 p • • • pP~ 2 


E = E(t, p) = r 


pP 2 •. p 1 p 

V P p ~ l P p ~ 2 • • • P 1 / 


For simplicity however, we use the first E with r = 1 throughout this paper. For the remaining parameters, we 
use p £ (0, 0.25, 0.75} and p £ {17, 250}, with the same n = 200. 


4. Conclusion and Discussion 

We have presented a variable importance score function in the context of ensemble learning. Our proposed 
score function is simple and more straightforward than its counterpart proposed in the context of random 
forest, and by avoiding permutations, it is by design computationally more efficient than the random forest 
variable importance function. Just like the random forest variable importance function, our score handles both 
regression and classification seamlessly. One of the distinct advantage of our proposed score is the fact that it 
offers a natural cut off at zero, with all the positive scores indicating importance and significance, while the 
negative scores are deemed indications of insignificance. An extra advantage of our proposed score lies in the 
fact it works very well beyond ensemble of trees and can seamlessly be used with any base learners in the 
random subspace learning context. Our examples, both simulated and real, demonstrated that our proposed 
score does compete mostly favorably with the random forest score. In our future work, we present and compare 
the corresponding average test errors of the single models made up of the most important variables. We also 
provide in our future work theoretical proofs of the connection between our score function and the significance 
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Variable 

(a) Permutation-free Variable Importance. 



Variable 

(b) Permutation-based Variable Importance. 


Fig 2. Variable Importance Scores for simulated data with mild correlation among the variables in low dimension high sample size 
setting 




Variable Variable 

(a) Permutation-free Variable Importance. (b) Permutation-based Variable Importance. 


Fig 3. Variable Importance Scores for simulated data with zero correlation among the variables in low dimension high sample size 
setting 
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omplaints 



(a) Permutation-free Variable Importance. (b) Permutation-based Variable Importance. 

Fig 4. Variable Importance Scores for the Attitude Data Set, for which n = 30 and p = 6. 




Variable Variable 

(a) Permutation-free Variable Importance. (b) Permutation-based Variable Importance. 

Fig 5. Variable Importance Scores for the Spam Detection Dataset where n = 200 and p = 7, and K — 2 classes. 


imsart-generic ver. 2014/07/30 file: perf-score-version-l.tex date: January 27, 2015 

























Ernest Fokoue/PERF Score 


7 



(a) Permutation-free Variable Importance. 



(b) Permutation-based Variable Importance. 


Fig 6. Variable Importance Scores for the Spam Detection Dataset where n = 4601 and p = 57, and K = 2 classes. 


of variables selected using existing criteria. It is also our plan to address the fact that sometimes the correlation 
structure among the predictor variables obscures the ability of our proposed score to correctly identify some 
significant variables. 
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